{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  贝叶斯拼写检查器 ###"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import re, collections\n",
    " \n",
    "def words(text): return re.findall('[a-z]+', text.lower()) \n",
    " \n",
    "def train(features):\n",
    "    model = collections.defaultdict(lambda: 1)\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "    return model\n",
    " \n",
    "NWORDS = train(words(open('big.txt').read()))\n",
    " \n",
    "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n",
    " \n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion\n",
    "               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition\n",
    "               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration\n",
    "               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion\n",
    " \n",
    "def known_edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)\n",
    " \n",
    "def known(words): return set(w for w in words if w in NWORDS)\n",
    " \n",
    "def correct(word):\n",
    "    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n",
    "    return max(candidates, key=lambda w: NWORDS[w])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'china'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#appl #appla #learw #tess #morw\n",
    "correct('chine')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 求解：argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w) ###\n",
    "\n",
    "* P(c), 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大\n",
    "* P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w\n",
    "* argmaxc, 用来枚举所有可能的 c 并且选取概率最大的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号\n",
    "def words(text): return re.findall('[a-z]+', text.lower()) \n",
    " \n",
    "def train(features):\n",
    "    model = collections.defaultdict(lambda: 1)\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "    return model\n",
    " \n",
    "NWORDS = train(words(open('big.txt').read()))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中. 于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中, 我们期望用一个很小的概率来代表这种情况. lambda: 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(<function __main__.train.<locals>.<lambda>>,\n",
       "            {'counterorders': 2,\n",
       "             'ureters': 3,\n",
       "             'displeasure': 9,\n",
       "             'omitted': 10,\n",
       "             'sparrow': 5,\n",
       "             'tubercle': 66,\n",
       "             'curse': 7,\n",
       "             'pauncefote': 2,\n",
       "             'updated': 5,\n",
       "             'gloomier': 4,\n",
       "             'foremost': 17,\n",
       "             'wabash': 2,\n",
       "             'anarchists': 4,\n",
       "             'intermediacy': 2,\n",
       "             'threadbare': 2,\n",
       "             'endeavouring': 9,\n",
       "             'freeholders': 11,\n",
       "             'irreproachably': 3,\n",
       "             'ignominious': 3,\n",
       "             'illuminated': 9,\n",
       "             'galitsyn': 2,\n",
       "             'struthers': 3,\n",
       "             'shuya': 2,\n",
       "             'futile': 16,\n",
       "             'each': 412,\n",
       "             'district': 38,\n",
       "             'acquiesced': 2,\n",
       "             'staircase': 14,\n",
       "             'shamelessly': 2,\n",
       "             'doubter': 2,\n",
       "             'plumage': 3,\n",
       "             'worming': 2,\n",
       "             'militiamen': 30,\n",
       "             'tombstones': 2,\n",
       "             'presupposable': 2,\n",
       "             'notable': 6,\n",
       "             'louise': 5,\n",
       "             'overtook': 17,\n",
       "             'abstraction': 8,\n",
       "             'displeased': 20,\n",
       "             'ranchmen': 2,\n",
       "             'instal': 2,\n",
       "             'kashmir': 3,\n",
       "             'nay': 4,\n",
       "             'wired': 5,\n",
       "             'pencil': 11,\n",
       "             'mustache': 46,\n",
       "             'breast': 87,\n",
       "             'dioxide': 9,\n",
       "             'disappointments': 4,\n",
       "             'impassive': 6,\n",
       "             'though': 651,\n",
       "             'floridas': 7,\n",
       "             'torban': 2,\n",
       "             'combine': 11,\n",
       "             'yawning': 7,\n",
       "             'homeless': 4,\n",
       "             'cinema': 2,\n",
       "             'subjects': 68,\n",
       "             'rib': 9,\n",
       "             'bin': 3,\n",
       "             'cylinders': 18,\n",
       "             'bijou': 2,\n",
       "             'acted': 38,\n",
       "             'accepted': 88,\n",
       "             'attainment': 11,\n",
       "             'mustered': 8,\n",
       "             'audacious': 2,\n",
       "             'respectable': 15,\n",
       "             'bilateral': 10,\n",
       "             'coraco': 2,\n",
       "             'stuffs': 2,\n",
       "             'reheat': 2,\n",
       "             'roberts': 3,\n",
       "             'trenton': 6,\n",
       "             'sharpening': 5,\n",
       "             'component': 6,\n",
       "             'pat': 4,\n",
       "             'animation': 32,\n",
       "             'coincidently': 5,\n",
       "             'cy': 2,\n",
       "             'smoker': 2,\n",
       "             'manes': 3,\n",
       "             'adelaide': 2,\n",
       "             'prayer': 43,\n",
       "             'industries': 65,\n",
       "             'advantageously': 5,\n",
       "             'dissolute': 3,\n",
       "             'tendon': 130,\n",
       "             'barton': 2,\n",
       "             'ablest': 2,\n",
       "             'episode': 12,\n",
       "             'barges': 3,\n",
       "             'sipping': 4,\n",
       "             'inoperative': 2,\n",
       "             'soap': 8,\n",
       "             'padlocks': 2,\n",
       "             'vagaries': 2,\n",
       "             'potemkins': 3,\n",
       "             'blackguard': 5,\n",
       "             'smashed': 11,\n",
       "             'bursitis': 17,\n",
       "             'goes': 61,\n",
       "             'prefix': 3,\n",
       "             'shops': 23,\n",
       "             'basketful': 2,\n",
       "             'stepfather': 22,\n",
       "             'veil': 17,\n",
       "             'adorers': 2,\n",
       "             'overhauled': 6,\n",
       "             'liquors': 3,\n",
       "             'bottoms': 3,\n",
       "             'plastun': 2,\n",
       "             'surest': 4,\n",
       "             'carlton': 5,\n",
       "             'friedland': 6,\n",
       "             'alice': 14,\n",
       "             'unhealthy': 15,\n",
       "             'cannula': 9,\n",
       "             'eleven': 22,\n",
       "             'persuasions': 3,\n",
       "             'cawolla': 2,\n",
       "             'elephants': 2,\n",
       "             'mechanicks': 2,\n",
       "             'kitten': 8,\n",
       "             'promotes': 2,\n",
       "             'venae': 2,\n",
       "             'matt': 2,\n",
       "             'private': 94,\n",
       "             'essential': 93,\n",
       "             'creating': 25,\n",
       "             'exclaiming': 5,\n",
       "             'extent': 100,\n",
       "             'oxidising': 2,\n",
       "             'dessicans': 3,\n",
       "             'uplands': 4,\n",
       "             'tops': 4,\n",
       "             'jerky': 6,\n",
       "             'irregularity': 6,\n",
       "             'recruitment': 3,\n",
       "             'fringes': 17,\n",
       "             'shopkeepers': 7,\n",
       "             'tendencies': 16,\n",
       "             'unconditionally': 3,\n",
       "             'brandy': 16,\n",
       "             'camberwell': 3,\n",
       "             'statue': 9,\n",
       "             'metatarsal': 9,\n",
       "             'measurement': 3,\n",
       "             'enclosures': 2,\n",
       "             'suspecting': 4,\n",
       "             'noses': 7,\n",
       "             'standard': 55,\n",
       "             'inspection': 19,\n",
       "             'enterprising': 6,\n",
       "             'freak': 4,\n",
       "             'liberating': 2,\n",
       "             'ordeal': 3,\n",
       "             'pancras': 2,\n",
       "             'luxury': 9,\n",
       "             'livery': 3,\n",
       "             'anconeus': 2,\n",
       "             'polypus': 4,\n",
       "             'leapt': 3,\n",
       "             'liberally': 2,\n",
       "             'finish': 50,\n",
       "             'previously': 56,\n",
       "             'mccarthy': 38,\n",
       "             'mallet': 6,\n",
       "             'bluestocking': 3,\n",
       "             'conveyance': 8,\n",
       "             'transformer': 2,\n",
       "             'compel': 10,\n",
       "             'blasphemies': 3,\n",
       "             'suggest': 25,\n",
       "             'shares': 4,\n",
       "             'dishonoured': 4,\n",
       "             'hen': 7,\n",
       "             'vols': 28,\n",
       "             'narcotisation': 2,\n",
       "             'speranski': 80,\n",
       "             'cherished': 15,\n",
       "             'overcoat': 27,\n",
       "             'malbrook': 2,\n",
       "             'nephroma': 2,\n",
       "             'habeus': 2,\n",
       "             'coward': 9,\n",
       "             'widower': 5,\n",
       "             'extremely': 52,\n",
       "             'resembling': 53,\n",
       "             'understood': 223,\n",
       "             'impetus': 10,\n",
       "             'actinomyces': 10,\n",
       "             'eosinophile': 4,\n",
       "             'pronounce': 10,\n",
       "             'arrangements': 30,\n",
       "             'inevitably': 33,\n",
       "             'hochgeboren': 2,\n",
       "             'crusted': 3,\n",
       "             'weeks': 118,\n",
       "             'slightest': 26,\n",
       "             'fords': 2,\n",
       "             'stimulatingly': 2,\n",
       "             'economically': 3,\n",
       "             'thrice': 9,\n",
       "             'peg': 5,\n",
       "             'adventurous': 4,\n",
       "             'mountainous': 3,\n",
       "             'potch': 2,\n",
       "             'adults': 27,\n",
       "             'kindled': 11,\n",
       "             'have': 3494,\n",
       "             'sedate': 3,\n",
       "             'democrats': 94,\n",
       "             'vaginitis': 2,\n",
       "             'foo': 2,\n",
       "             'headgear': 2,\n",
       "             'gape': 8,\n",
       "             'reassigned': 2,\n",
       "             'incompletely': 2,\n",
       "             'pharmacopoeial': 2,\n",
       "             'feelings': 79,\n",
       "             'phone': 3,\n",
       "             'anger': 60,\n",
       "             'improvisations': 2,\n",
       "             'dethrone': 2,\n",
       "             'toothed': 2,\n",
       "             'sweetish': 2,\n",
       "             'tack': 4,\n",
       "             'unwinding': 3,\n",
       "             'pediculosis': 2,\n",
       "             'overfed': 2,\n",
       "             'rabble': 8,\n",
       "             'opsonins': 4,\n",
       "             'ver': 3,\n",
       "             'postures': 3,\n",
       "             'entertainment': 8,\n",
       "             'unkind': 5,\n",
       "             'lightest': 3,\n",
       "             'undergone': 10,\n",
       "             'persons': 120,\n",
       "             'mention': 47,\n",
       "             'iodism': 2,\n",
       "             'sterne': 2,\n",
       "             'kolocha': 17,\n",
       "             'wecollect': 2,\n",
       "             'asked': 778,\n",
       "             'augury': 2,\n",
       "             'uhlans': 29,\n",
       "             'fist': 9,\n",
       "             'winner': 3,\n",
       "             'praise': 17,\n",
       "             'legislative': 32,\n",
       "             'greediness': 3,\n",
       "             'resembled': 9,\n",
       "             'expense': 34,\n",
       "             'despotic': 2,\n",
       "             'storage': 4,\n",
       "             'league': 54,\n",
       "             'granular': 8,\n",
       "             'differed': 11,\n",
       "             'enlistment': 2,\n",
       "             'authorizing': 17,\n",
       "             'remembers': 7,\n",
       "             'outlays': 4,\n",
       "             'malignant': 89,\n",
       "             'allowed': 87,\n",
       "             'rang': 30,\n",
       "             'wing': 33,\n",
       "             'sped': 2,\n",
       "             'map': 39,\n",
       "             'might': 537,\n",
       "             'disrupting': 2,\n",
       "             'pyrexia': 7,\n",
       "             'besprinkled': 3,\n",
       "             'reddaway': 3,\n",
       "             'intrusions': 2,\n",
       "             'respectively': 11,\n",
       "             'ilagin': 26,\n",
       "             'portray': 2,\n",
       "             'shenkii': 2,\n",
       "             'drilled': 5,\n",
       "             'devilish': 4,\n",
       "             'abate': 6,\n",
       "             'trophic': 21,\n",
       "             'soil': 94,\n",
       "             'smelters': 3,\n",
       "             'people': 900,\n",
       "             'minerals': 9,\n",
       "             'site': 33,\n",
       "             'ceremony': 18,\n",
       "             'rostov': 777,\n",
       "             'occupation': 53,\n",
       "             'whale': 4,\n",
       "             'definitely': 36,\n",
       "             'steam': 31,\n",
       "             'turbinate': 2,\n",
       "             'recollection': 24,\n",
       "             'sports': 3,\n",
       "             'apportioned': 12,\n",
       "             'ripon': 2,\n",
       "             'representations': 4,\n",
       "             'calcanean': 3,\n",
       "             'demonetized': 2,\n",
       "             'cloaks': 10,\n",
       "             'river': 113,\n",
       "             'industrial': 99,\n",
       "             'foreman': 6,\n",
       "             'girt': 4,\n",
       "             'close': 220,\n",
       "             'warne': 2,\n",
       "             'blubberers': 2,\n",
       "             'disruption': 5,\n",
       "             'diligently': 4,\n",
       "             'blamelessly': 2,\n",
       "             'cornwall': 3,\n",
       "             'edinburgh': 26,\n",
       "             'denouncing': 5,\n",
       "             'sasha': 3,\n",
       "             'zeal': 26,\n",
       "             'shouted': 255,\n",
       "             'powerlessness': 2,\n",
       "             'helpless': 23,\n",
       "             'cheapest': 3,\n",
       "             'homely': 9,\n",
       "             'pyosalpynx': 2,\n",
       "             'poorhouse': 2,\n",
       "             'adhesive': 4,\n",
       "             'lumber': 15,\n",
       "             'cawing': 3,\n",
       "             'splashed': 7,\n",
       "             'interphalangeal': 3,\n",
       "             'maintenance': 11,\n",
       "             'pervade': 2,\n",
       "             'beads': 5,\n",
       "             'scarcity': 5,\n",
       "             'anticipate': 5,\n",
       "             'orthodox': 10,\n",
       "             'sponge': 8,\n",
       "             'infer': 4,\n",
       "             'inverted': 4,\n",
       "             'bleak': 5,\n",
       "             'main': 110,\n",
       "             'doddering': 2,\n",
       "             'ligation': 53,\n",
       "             'shiloh': 3,\n",
       "             'haitian': 2,\n",
       "             'hindrance': 12,\n",
       "             'amputate': 5,\n",
       "             'obstructed': 11,\n",
       "             'extensively': 9,\n",
       "             'raised': 213,\n",
       "             'visualized': 2,\n",
       "             'showed': 150,\n",
       "             'awaits': 8,\n",
       "             'specified': 10,\n",
       "             'gush': 2,\n",
       "             'partially': 11,\n",
       "             'calibres': 2,\n",
       "             'kettles': 2,\n",
       "             'easy': 123,\n",
       "             'confirming': 4,\n",
       "             'hordes': 4,\n",
       "             'saved': 52,\n",
       "             'sneered': 3,\n",
       "             'peevishly': 2,\n",
       "             'unexpectedly': 58,\n",
       "             'handedness': 2,\n",
       "             'gigantic': 24,\n",
       "             'bouts': 3,\n",
       "             'hoosier': 2,\n",
       "             'plural': 4,\n",
       "             'disuse': 6,\n",
       "             'flora': 10,\n",
       "             'wriggles': 2,\n",
       "             'reproving': 2,\n",
       "             'unrepaired': 2,\n",
       "             'repose': 8,\n",
       "             'baltic': 2,\n",
       "             'whistle': 29,\n",
       "             'barbarous': 3,\n",
       "             'warp': 3,\n",
       "             'vozdvizhenka': 8,\n",
       "             'intubation': 7,\n",
       "             'dreaminess': 2,\n",
       "             'creek': 4,\n",
       "             'suffrage': 116,\n",
       "             'gay': 36,\n",
       "             'bagovut': 3,\n",
       "             'dropping': 19,\n",
       "             'poultry': 3,\n",
       "             'sentiments': 15,\n",
       "             'bivouacs': 4,\n",
       "             'pobox': 5,\n",
       "             'fuller': 13,\n",
       "             'bombard': 2,\n",
       "             'vanishes': 4,\n",
       "             'independent': 74,\n",
       "             'card': 31,\n",
       "             'sheen': 2,\n",
       "             'descends': 3,\n",
       "             'cerebritis': 2,\n",
       "             'sparrows': 2,\n",
       "             'stab': 9,\n",
       "             'coverlet': 2,\n",
       "             'footmarks': 4,\n",
       "             'homicidal': 3,\n",
       "             'frustration': 2,\n",
       "             'sicca': 4,\n",
       "             'slowly': 134,\n",
       "             'stifling': 3,\n",
       "             'poetry': 11,\n",
       "             'inadvertent': 2,\n",
       "             'faints': 2,\n",
       "             'tudor': 2,\n",
       "             'hancocks': 2,\n",
       "             'postulated': 2,\n",
       "             'delegate': 10,\n",
       "             'ratify': 8,\n",
       "             'rag': 7,\n",
       "             'yankovo': 4,\n",
       "             'staked': 6,\n",
       "             'leprosy': 5,\n",
       "             'cheyenne': 2,\n",
       "             'verbally': 2,\n",
       "             're': 190,\n",
       "             'pedunculated': 14,\n",
       "             'splendidly': 12,\n",
       "             'troubled': 18,\n",
       "             'healed': 17,\n",
       "             'tuning': 2,\n",
       "             'farmer': 28,\n",
       "             'bourbon': 3,\n",
       "             'differently': 26,\n",
       "             'descry': 2,\n",
       "             'filter': 5,\n",
       "             'sob': 20,\n",
       "             'newfoundland': 2,\n",
       "             'flattered': 18,\n",
       "             'lotions': 9,\n",
       "             'refinements': 6,\n",
       "             'overtake': 13,\n",
       "             'offerings': 3,\n",
       "             'kent': 6,\n",
       "             'pimple': 2,\n",
       "             'carousals': 4,\n",
       "             'intrigue': 12,\n",
       "             'salut': 2,\n",
       "             'pliant': 2,\n",
       "             'zum': 2,\n",
       "             'saw': 600,\n",
       "             'load': 9,\n",
       "             'engineers': 5,\n",
       "             'imports': 12,\n",
       "             'indifference': 27,\n",
       "             'are': 3631,\n",
       "             'wringing': 3,\n",
       "             'kidney': 22,\n",
       "             'nun': 4,\n",
       "             'wool': 43,\n",
       "             'utero': 2,\n",
       "             'tag': 3,\n",
       "             'hamlet': 2,\n",
       "             'attain': 51,\n",
       "             'ingested': 2,\n",
       "             'longitudinal': 6,\n",
       "             'detestable': 4,\n",
       "             'inspector': 32,\n",
       "             'hernia': 14,\n",
       "             'circulates': 2,\n",
       "             'adieu': 7,\n",
       "             'annulment': 4,\n",
       "             'tumour': 224,\n",
       "             'elect': 13,\n",
       "             'sidorov': 5,\n",
       "             'eats': 9,\n",
       "             'bond': 27,\n",
       "             'messages': 7,\n",
       "             'surgeon': 44,\n",
       "             'dissecting': 6,\n",
       "             'clarifying': 2,\n",
       "             'stout': 63,\n",
       "             'wrung': 18,\n",
       "             'sincere': 23,\n",
       "             'reverie': 8,\n",
       "             'stampede': 3,\n",
       "             'vindicate': 2,\n",
       "             'cartload': 2,\n",
       "             'semiopen': 2,\n",
       "             'gainful': 2,\n",
       "             'noting': 11,\n",
       "             'marsh': 4,\n",
       "             'interfering': 20,\n",
       "             'sevres': 3,\n",
       "             'charmee': 2,\n",
       "             'cherish': 7,\n",
       "             'beseech': 4,\n",
       "             'transylvania': 4,\n",
       "             'new': 1212,\n",
       "             'weason': 2,\n",
       "             'frontiersmen': 6,\n",
       "             'feverish': 25,\n",
       "             'establish': 46,\n",
       "             'grassy': 2,\n",
       "             'assent': 10,\n",
       "             'muscularly': 4,\n",
       "             'rire': 2,\n",
       "             'mcmaster': 11,\n",
       "             'pineapples': 3,\n",
       "             'maritime': 6,\n",
       "             'debauchery': 6,\n",
       "             'disliked': 21,\n",
       "             'relationships': 3,\n",
       "             'swarmed': 5,\n",
       "             'surpassed': 4,\n",
       "             'discernible': 4,\n",
       "             'thyreoid': 22,\n",
       "             'paddle': 4,\n",
       "             'zoology': 4,\n",
       "             'tenn': 4,\n",
       "             'wood': 89,\n",
       "             'persuasiveness': 3,\n",
       "             'bashful': 3,\n",
       "             'uterine': 12,\n",
       "             'convention': 152,\n",
       "             'unshapely': 2,\n",
       "             'homes': 36,\n",
       "             'bicipital': 3,\n",
       "             'admirable': 15,\n",
       "             'nightshirt': 2,\n",
       "             'fibrin': 17,\n",
       "             'bartenstein': 3,\n",
       "             'guess': 18,\n",
       "             'circlet': 2,\n",
       "             'adventitious': 9,\n",
       "             'indemnify': 6,\n",
       "             'typewriting': 5,\n",
       "             'expunging': 2,\n",
       "             'quand': 6,\n",
       "             'exactions': 2,\n",
       "             'receiving': 55,\n",
       "             'incur': 3,\n",
       "             'giants': 3,\n",
       "             'tearing': 23,\n",
       "             'probing': 4,\n",
       "             'devise': 10,\n",
       "             'hearth': 2,\n",
       "             'placental': 2,\n",
       "             'hammering': 6,\n",
       "             'defeating': 3,\n",
       "             'womanly': 8,\n",
       "             'jewellery': 3,\n",
       "             'kuragins': 4,\n",
       "             'target': 6,\n",
       "             'stevedore': 2,\n",
       "             'safest': 3,\n",
       "             'sounder': 3,\n",
       "             'likes': 9,\n",
       "             'appointments': 12,\n",
       "             'speech': 83,\n",
       "             'quaker': 3,\n",
       "             'antonov': 2,\n",
       "             'proofs': 15,\n",
       "             'boasting': 6,\n",
       "             'wegiment': 3,\n",
       "             'disciplined': 4,\n",
       "             'occupancy': 2,\n",
       "             'flare': 2,\n",
       "             'copenhagen': 3,\n",
       "             'zenger': 4,\n",
       "             'battalions': 27,\n",
       "             'truth': 116,\n",
       "             'research': 37,\n",
       "             'recurvatum': 2,\n",
       "             'notion': 17,\n",
       "             'dishonour': 2,\n",
       "             'glittered': 17,\n",
       "             'kittenish': 2,\n",
       "             'handed': 81,\n",
       "             'acquiescence': 2,\n",
       "             'waggled': 2,\n",
       "             'forged': 5,\n",
       "             'unjust': 11,\n",
       "             'callous': 13,\n",
       "             'mantles': 2,\n",
       "             'election': 120,\n",
       "             'drummed': 2,\n",
       "             'salve': 2,\n",
       "             'agrees': 3,\n",
       "             'discharging': 5,\n",
       "             'cannons': 2,\n",
       "             'impure': 3,\n",
       "             'probable': 28,\n",
       "             'administration': 118,\n",
       "             'considerable': 174,\n",
       "             'forwards': 19,\n",
       "             'waterloo': 10,\n",
       "             'flail': 5,\n",
       "             'claiming': 7,\n",
       "             'phalanges': 12,\n",
       "             'bondage': 16,\n",
       "             'pelageya': 17,\n",
       "             'bigger': 8,\n",
       "             'southern': 197,\n",
       "             'perch': 5,\n",
       "             'enriched': 4,\n",
       "             'metaphysis': 3,\n",
       "             'protection': 64,\n",
       "             'factotum': 2,\n",
       "             'cavalryman': 6,\n",
       "             'radiated': 6,\n",
       "             'cheerfully': 14,\n",
       "             'solid': 42,\n",
       "             'vines': 2,\n",
       "             'scarves': 4,\n",
       "             'quick': 82,\n",
       "             'notabilities': 4,\n",
       "             'him': 5231,\n",
       "             'sixteenth': 12,\n",
       "             'ignoring': 2,\n",
       "             'deserters': 2,\n",
       "             'protege': 5,\n",
       "             'indulgence': 10,\n",
       "             'supreme': 75,\n",
       "             'closes': 7,\n",
       "             'shilling': 5,\n",
       "             'footing': 16,\n",
       "             'mission': 35,\n",
       "             'madame': 44,\n",
       "             'dissatisfied': 35,\n",
       "             'signatures': 4,\n",
       "             'helps': 6,\n",
       "             'garlic': 2,\n",
       "             'wart': 6,\n",
       "             'won': 202,\n",
       "             'overlooked': 14,\n",
       "             'lanfrey': 2,\n",
       "             'dulness': 2,\n",
       "             'unnatural': 38,\n",
       "             'supplier': 3,\n",
       "             'harassed': 4,\n",
       "             'hare': 37,\n",
       "             'slide': 5,\n",
       "             'necessitates': 6,\n",
       "             'conceived': 8,\n",
       "             'mode': 19,\n",
       "             'chant': 2,\n",
       "             'packing': 35,\n",
       "             'tentacles': 2,\n",
       "             'liberality': 2,\n",
       "             'phantasm': 2,\n",
       "             'gloat': 3,\n",
       "             'promptitude': 2,\n",
       "             'merchants': 55,\n",
       "             'whatnots': 2,\n",
       "             'spirited': 13,\n",
       "             'rupia': 5,\n",
       "             'succumbs': 2,\n",
       "             'fondest': 3,\n",
       "             'rusty': 5,\n",
       "             'strapped': 2,\n",
       "             'looking': 490,\n",
       "             'numerical': 6,\n",
       "             'jaws': 19,\n",
       "             'mann': 2,\n",
       "             'smelter': 2,\n",
       "             'becher': 4,\n",
       "             'comprehensive': 5,\n",
       "             'vessel': 138,\n",
       "             'code': 14,\n",
       "             'twopence': 3,\n",
       "             'semilunar': 9,\n",
       "             'elected': 45,\n",
       "             'tone': 167,\n",
       "             'epithelium': 56,\n",
       "             'steve': 2,\n",
       "             'pinckney': 4,\n",
       "             'knapsack': 8,\n",
       "             'kneeling': 9,\n",
       "             'strand': 6,\n",
       "             'solitude': 19,\n",
       "             'gentlemanly': 2,\n",
       "             'thoughts': 126,\n",
       "             'castanet': 2,\n",
       "             'bushy': 10,\n",
       "             'descried': 4,\n",
       "             'aponeurotic': 2,\n",
       "             'surrenders': 2,\n",
       "             'ordered': 149,\n",
       "             'emancipation': 24,\n",
       "             'alley': 6,\n",
       "             'blazers': 3,\n",
       "             'servants': 89,\n",
       "             'rests': 18,\n",
       "             'tooth': 28,\n",
       "             'risks': 15,\n",
       "             'yoke': 6,\n",
       "             'inflammation': 94,\n",
       "             'blonde': 10,\n",
       "             'album': 6,\n",
       "             'duc': 12,\n",
       "             'thorns': 5,\n",
       "             'planter': 19,\n",
       "             'log': 21,\n",
       "             'swarm': 8,\n",
       "             'trocar': 4,\n",
       "             'injured': 56,\n",
       "             'liquor': 14,\n",
       "             'perforations': 7,\n",
       "             'censuring': 2,\n",
       "             'contracture': 29,\n",
       "             'informer': 2,\n",
       "             'switzerland': 9,\n",
       "             'ponies': 2,\n",
       "             'glass': 117,\n",
       "             'burdensome': 5,\n",
       "             'security': 22,\n",
       "             'bitch': 13,\n",
       "             'bacillary': 5,\n",
       "             'transmuted': 2,\n",
       "             'atrocious': 3,\n",
       "             'elects': 2,\n",
       "             'but': 5654,\n",
       "             'muir': 3,\n",
       "             'nikita': 5,\n",
       "             'muddled': 4,\n",
       "             'lifts': 3,\n",
       "             'impresses': 2,\n",
       "             'slim': 14,\n",
       "             'maksim': 2,\n",
       "             'garrulously': 2,\n",
       "             'utterances': 2,\n",
       "             'stammered': 6,\n",
       "             'midsummer': 2,\n",
       "             'trousseau': 5,\n",
       "             'hogs': 3,\n",
       "             'metacarpals': 3,\n",
       "             'blindfold': 5,\n",
       "             'nonmoral': 2,\n",
       "             'moonlight': 19,\n",
       "             'waddling': 6,\n",
       "             'pointing': 89,\n",
       "             'differentiating': 4,\n",
       "             'silly': 14,\n",
       "             'ingest': 2,\n",
       "             'imply': 13,\n",
       "             'enlarges': 3,\n",
       "             'cart': 57,\n",
       "             'differ': 16,\n",
       "             'inflamed': 54,\n",
       "             'bolding': 2,\n",
       "             'shave': 4,\n",
       "             'adroitly': 4,\n",
       "             'served': 64,\n",
       "             'straining': 19,\n",
       "             'fall': 125,\n",
       "             'autocrats': 3,\n",
       "             'unreasonably': 2,\n",
       "             'salter': 2,\n",
       "             'ramshackle': 2,\n",
       "             'eminent': 13,\n",
       "             'molle': 2,\n",
       "             'pained': 8,\n",
       "             'loyal': 21,\n",
       "             'chalk': 14,\n",
       "             'willard': 3,\n",
       "             'idleness': 13,\n",
       "             'canceled': 4,\n",
       "             'inoculation': 23,\n",
       "             'sounding': 7,\n",
       "             'loopholes': 2,\n",
       "             'initial': 20,\n",
       "             'vicar': 3,\n",
       "             'oklahoma': 16,\n",
       "             'precautions': 15,\n",
       "             'commensurable': 2,\n",
       "             'batch': 2,\n",
       "             'push': 20,\n",
       "             'sublimed': 2,\n",
       "             'warned': 19,\n",
       "             'meuse': 4,\n",
       "             'snoring': 7,\n",
       "             'manner': 136,\n",
       "             'this': 4064,\n",
       "             'household': 56,\n",
       "             'millionaires': 6,\n",
       "             'duodenum': 3,\n",
       "             'dixon': 2,\n",
       "             'liberated': 11,\n",
       "             'arrangement': 36,\n",
       "             'soulless': 2,\n",
       "             'reserved': 16,\n",
       "             'essen': 2,\n",
       "             'whirlpool': 2,\n",
       "             'duport': 9,\n",
       "             'hellish': 4,\n",
       "             'engaged': 88,\n",
       "             'apropos': 2,\n",
       "             'retuned': 2,\n",
       "             'cancellation': 2,\n",
       "             'crack': 21,\n",
       "             'morrant': 2,\n",
       "             'pleasurable': 2,\n",
       "             'sunk': 28,\n",
       "             'indorsed': 7,\n",
       "             'wanted': 214,\n",
       "             'integration': 3,\n",
       "             'translations': 2,\n",
       "             'framework': 24,\n",
       "             'skill': 35,\n",
       "             'upper': 131,\n",
       "             'mufti': 2,\n",
       "             'softened': 29,\n",
       "             'callosity': 5,\n",
       "             'thrombosis': 40,\n",
       "             'septum': 4,\n",
       "             'sont': 3,\n",
       "             'sheaf': 2,\n",
       "             'redistribution': 6,\n",
       "             'clymer': 2,\n",
       "             'fatal': 64,\n",
       "             'laid': 187,\n",
       "             'nares': 2,\n",
       "             'archway': 2,\n",
       "             'deterred': 2,\n",
       "             'oscillates': 2,\n",
       "             'ineffective': 2,\n",
       "             'vacant': 11,\n",
       "             'confide': 7,\n",
       "             'nominal': 9,\n",
       "             'hiram': 2,\n",
       "             'resected': 12,\n",
       "             'unwanted': 2,\n",
       "             'pattern': 8,\n",
       "             'cicatricial': 42,\n",
       "             'bowl': 7,\n",
       "             'nor': 281,\n",
       "             'wobbly': 3,\n",
       "             'sarcomas': 9,\n",
       "             'solder': 2,\n",
       "             'laceration': 10,\n",
       "             'illustrating': 6,\n",
       "             'psammoma': 3,\n",
       "             'stuck': 21,\n",
       "             'perversion': 2,\n",
       "             'jewels': 6,\n",
       "             'country': 424,\n",
       "             'enforce': 30,\n",
       "             'battlefields': 2,\n",
       "             'speechless': 2,\n",
       "             'ruin': 49,\n",
       "             'grassland': 2,\n",
       "             'mistrusting': 2,\n",
       "             'bench': 27,\n",
       "             'scurrying': 2,\n",
       "             'rhetor': 25,\n",
       "             'morton': 2,\n",
       "             'suggestiveness': 2,\n",
       "             'remorseless': 2,\n",
       "             'divert': 7,\n",
       "             'melyukovs': 8,\n",
       "             'obviates': 2,\n",
       "             'cooperative': 10,\n",
       "             'started': 97,\n",
       "             'employe': 3,\n",
       "             'perfidiousness': 2,\n",
       "             'replied': 321,\n",
       "             'imagined': 49,\n",
       "             'marvel': 2,\n",
       "             'pitiable': 8,\n",
       "             'genteel': 2,\n",
       "             'unlocked': 9,\n",
       "             'independence': 152,\n",
       "             'appealed': 14,\n",
       "             'flexible': 6,\n",
       "             'xxiv': 9,\n",
       "             'authority': 101,\n",
       "             'frilled': 2,\n",
       "             'thoroughbred': 6,\n",
       "             'accuser': 2,\n",
       "             'hesitating': 17,\n",
       "             'pork': 4,\n",
       "             'voltaires': 2,\n",
       "             'bengal': 3,\n",
       "             'tends': 49,\n",
       "             'wastage': 2,\n",
       "             'shrill': 14,\n",
       "             'profunda': 4,\n",
       "             'thanksgiving': 7,\n",
       "             'enclose': 2,\n",
       "             'seedy': 3,\n",
       "             'furs': 11,\n",
       "             'splash': 5,\n",
       "             'supplementary': 4,\n",
       "             'sedately': 3,\n",
       "             'certainly': 120,\n",
       "             'puppet': 4,\n",
       "             'injustice': 14,\n",
       "             'lanolin': 4,\n",
       "             'thicknesses': 2,\n",
       "             'excuses': 7,\n",
       "             'faces': 163,\n",
       "             'fourteenth': 28,\n",
       "             'trimming': 2,\n",
       "             'prick': 6,\n",
       "             'economic': 121,\n",
       "             'stands': 20,\n",
       "             'programming': 2,\n",
       "             'princes': 12,\n",
       "             'lottery': 2,\n",
       "             'exsiccated': 2,\n",
       "             'parceled': 2,\n",
       "             'musculo': 9,\n",
       "             'standstill': 3,\n",
       "             'consolation': 19,\n",
       "             'uncle': 136,\n",
       "             'federalism': 2,\n",
       "             'intravenous': 4,\n",
       "             'assumption': 32,\n",
       "             'doors': 48,\n",
       "             'lisa': 2,\n",
       "             'disregard': 7,\n",
       "             'appendix': 12,\n",
       "             'myoma': 9,\n",
       "             'cost': 66,\n",
       "             'parrot': 6,\n",
       "             'manoeuvre': 2,\n",
       "             'achtung': 2,\n",
       "             'acorn': 3,\n",
       "             'aggrieved': 4,\n",
       "             'cutlery': 2,\n",
       "             'glorious': 12,\n",
       "             'rotten': 9,\n",
       "             'denoted': 2,\n",
       "             'muster': 4,\n",
       "             'hug': 2,\n",
       "             'eccentricity': 3,\n",
       "             'susquehanna': 4,\n",
       "             'partake': 4,\n",
       "             'nicely': 10,\n",
       "             'thing': 304,\n",
       "             'wages': 51,\n",
       "             'dislike': 10,\n",
       "             'beams': 10,\n",
       "             'spree': 3,\n",
       "             'antagonizing': 2,\n",
       "             'advises': 4,\n",
       "             'snuffling': 2,\n",
       "             'rods': 6,\n",
       "             'borisov': 2,\n",
       "             'mayor': 16,\n",
       "             'mathematics': 12,\n",
       "             'invent': 11,\n",
       "             'teaching': 19,\n",
       "             'girdle': 4,\n",
       "             'averting': 4,\n",
       "             'elijah': 2,\n",
       "             'platelets': 2,\n",
       "             'uvula': 2,\n",
       "             'mumbling': 5,\n",
       "             'retreat': 95,\n",
       "             'fwashing': 2,\n",
       "             'bar': 26,\n",
       "             'scudding': 2,\n",
       "             'nowadays': 19,\n",
       "             'loftiness': 3,\n",
       "             'ceded': 9,\n",
       "             'delirium': 35,\n",
       "             'wiring': 3,\n",
       "             'centre': 51,\n",
       "             'frightfully': 2,\n",
       "             'wag': 6,\n",
       "             'sockets': 2,\n",
       "             'fluctuates': 2,\n",
       "             'concealment': 4,\n",
       "             'intima': 11,\n",
       "             'burnt': 9,\n",
       "             'tiding': 3,\n",
       "             'osteomas': 4,\n",
       "             'tug': 5,\n",
       "             'vis': 2,\n",
       "             'fabrication': 2,\n",
       "             'powers': 150,\n",
       "             'sweeps': 2,\n",
       "             'supervene': 7,\n",
       "             'meal': 16,\n",
       "             'briskly': 20,\n",
       "             'reinforce': 3,\n",
       "             'devriez': 2,\n",
       "             'youngster': 5,\n",
       "             'coast': 41,\n",
       "             'sameness': 2,\n",
       "             'about': 1498,\n",
       "             'whither': 9,\n",
       "             'tolstoy': 14,\n",
       "             'certain': 362,\n",
       "             'paunch': 2,\n",
       "             'laurel': 4,\n",
       "             'stamping': 4,\n",
       "             'incorporate': 4,\n",
       "             ...})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NWORDS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  编辑距离: ###\n",
    "两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#返回所有与单词 w 编辑距离为 1 的集合\n",
    "def edits1(word):\n",
    "    n = len(word)\n",
    "    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion\n",
    "               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition\n",
    "               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration\n",
    "               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与 something 编辑距离为2的单词居然达到了 114,324 个\n",
    "\n",
    "优化:在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词,只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#返回所有与单词 w 编辑距离为 2 的集合\n",
    "#在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词\n",
    "def edits2(word):\n",
    "    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等.但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def known(words): return set(w for w in words if w in NWORDS)\n",
    "\n",
    "#如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的\n",
    "def correct(word):\n",
    "    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]\n",
    "    return max(candidates, key=lambda w: NWORDS[w])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
